Ph. 

<: 

2 

■<-* 

C3 

•*— » 

> 

O 

o 
m 

o 

oo 
o 



% 



Generalized SURE for Exponential Families: 
Applications to Regularization 



Yonina C. Eldar 



Abstract 



oo ; 

Stein's unbiased risk estimate (SURE) was proposed by Stein for the independent, identically distributed (iid) 

(N 



Gaussian model in order to derive estimates that dominate least-squares (LS). In recent years, the SURE criterion 
has been employed in a variety of denoising problems for choosing regularization parameters that minimize an 
estimate of the mean-squared error (MSE). However, its use has been limited to the iid case which precludes many 
important applications. In this paper we begin by deriving a SURE counterpart for general, not necessarily iid 
distributions from the exponential family. This enables extending the SURE design technique to a much broader 
class of problems. Based on this generalization we suggest a new method for choosing regularization parameters 
in penalized LS estimators. We then demonstrate its superior performance over the conventional generalized cross 
validation approach and the discrepancy method in the context of image deblurring and deconvolution. The SURE 
technique can also be used to design estimates without predefining their structure. However, allowing for too many 
free parameters impairs the performance of the resulting estimates. To address this inherent tradeoff we propose a 
regularized SURE objective. Based on this design criterion, we derive a wavelet denoising strategy that is similar in 
sprit to the standard soft-threshold approach but can lead to improved MSE performance. 

I. Introduction 

Estimation in multivariate problems is a fundamental topic in statistical signal processing. One of the most 
common recovery strategies for deterministic unknown parameters is the well-known maximum likelihood 
(ML) method. The ML estimator enjoys several appealing properties, including asymptotic efficiency under 
suitable regularity conditions. Nonetheless, its mean-squared error (MSE) can be improved upon in the 
non-asymptotic regime in many different settings. 
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In their seminal work, Stein and James showed that for the independent, identically-distributed (iid) linear 
Gaussian model, it is possible to construct a nonlinear estimator with lower MSE than that of ML for all 
values of the unknowns [1], [2]. Various modifications of the James-Stein method have since been developed 
that are applicable to the non-iid Gaussian case as well [3], [3], [5], [6], [7], [8]. 

The James-Stein approach is based on the Stein unbiased risk estimate (SURE) [9], [10], which is an unbiased 
estimate of the MSE. Since the MSE in general depends on the true unknown parameter values it cannot 
be used as a design objective. However, using the SURE principle leads to a relatively simple technique for 
determining methods that have lower MSE than ML. The idea is to choose a class of estimates, and then 
select the member from the class that minimizes the SURE estimate of the MSE. This strategy has been 
applied to a variety of different denoising techniques [TT], p2], [13], [H]. Typically, in these problems, implicit 
prior information on the signal to be recovered is incorporated into the chosen structure of the estimate. For 
example, in wavelet denoising the signal is assumed to be sparse in the wavelet domain which motives the use 
of thresholding. Only the value of the threshold is determined by the SURE principle. 

The SURE method is appealing as it allows to directly approximate the MSE of an estimate from the 
data, without requiring knowledge of the true parameter values. However, it has two main drawbacks which 
severely limit its use in practical applications. The first restriction is that it was originally limited to the 
iid Gaussian case. Several extensions have been developed for different independent models. In particular, a 
SURE principle for iid, infinitely divisible random variables with finite variance is derived in [15]. Extensions 
to independent variables from a continuous exponential family are treated in [16], [17] . while the discrete 
exponential case is discussed in [18J. All of these generalizations are confined to the independent case which 
precludes a variety of important applications such as image deblurring. 

The second drawback of using SURE as a design criterion is that in order to get meaningful estimators 
the basic structure of the estimate must be determined in advance. If no parametrization is assumed, then 
there are too many free variables to be optimized, and the SURE method will typically not lead to good MSE 
behavior. 

In this paper we extend the basic SURE principle in two directions, in order to circumvent the two fundamen- 
tal drawbacks outlined above. First, we generalize SURE to multivariate, possibly non-iid exponential families. 



In particular, we develop an unbiased estimate of the MSE for a general Gaussian vector model. Exponential 
probability density functions (pdfs) play an important role in statistics due to the Pitman-Koopman-Darmois 
theorem |19| . |20| . [21] . which states that among distributions whose domain does not vary with the parame- 
ter being estimated, only in exponential families is there a sufficient statistic with bounded dimension as the 
sample size increases |22| . Furthermore, efficient estimators exist only when the underlying model is expo- 
nential. Many known distributions are of the exponential form, such as Gaussian, gamma, chi-square, beta, 
Dirichlet, Bernoulli, binomial, multinomial, Poisson, and geometric distributions. Our result has important 
practical value as it extends the applicability of the SURE technique to more general estimation models, and 
in particular to scenarios in which the observations are dependent. This is the situation, for example, when 
using overcomplete wavelet transforms, and in image deblurring. In addition, we derive results for the case in 
which the model is rank-deficient so that the pdf depends only on a projection of the parameter vector. 

An immediate application of this extension is to the general linear Gaussian model. In this setup, we seek 
to estimate a parameter vector 6 from noisy, blurred measurements x = H# + w where w is a Gaussian 
noise vector and H may be rank deficient. One of the most popular recovery strategies in this context is the 
regularized least-squares (LS) method. In this approach, the estimate is designed to minimize a regularized 
LS objective where typical choices of penalization are weighted £2 or £\ norms. An important aspect of this 
technique, which significantly impacts its performance, is selecting the regularization parameter. A variety 
of different methods have been proposed for this purpose [21], [25], [26], [27], [28], [29], [30], [HI]. When an 
£2 norm is used to penalize the LS solution, the resulting estimate is linear and is referred to as Tikhonov 
regularization. One of the most popular regularization selection methods in this context is generalized cross- 
validation (GCV) [32]. When an £\ penalty is used, the resulting estimate is nonlinear so that applying the 
GCV approach is more complicated. An alternative choice is the discrepancy method in which the parameter 
is chosen such that the resulting data error is equal to the noise variance. 

Here, we suggest an alternative strategy based on our extended SURE criterion. Specifically, we use SURE 
to evaluate the MSE of the penalized solution for any choice of regularization, and then select the value 
that minimizes the SURE estimate. This allows SURE-based optimization of a broad class of deblurring and 
deconvolution methods including both linear and nonlinear techniques. A similar approach was studied in 



the special case of Tikhonov regularization with white noise in [24], [25] . [26] . However, our technique is not 
limited to an £2 penalty and can be used with any other penalized LS method. When the estimate is not 
given explicitly but rather as a solution of an optimization problem we can still employ the SURE strategy 
by using a Monte-Carlo approach to approximate the derivative of the estimate, which figures in the SURE 
expression [39]. Using several test images and a deconvolution problem, we demonstrate that this strategy 
often leads to significant performance improvement over the standard GCV and discrepancy selection criteria 
in the context of image deblurring and deconvolution. 

Finally, to circumvent the need for pre-defining a particular structure when applying SURE, we propose an 
alternative approach based on regularizing the SURE objective. Specifically, we suggest adding a penalization 
term to the SURE expression and choosing an estimate that minimizes the regularized function. In this way, 
we can control the properties of the estimate without having to a-priori assume a specific structure. We then 
illustrate this strategy in the context of wavelet denoising. Instead of assuming a threshold estimate and 
choosing the threshold to minimize the SURE criterion, as in [llj . we design an estimate that minimizes an 
£\ regularized SURE objective. The resulting denoising scheme has the form of a threshold with a particular 
form of shrinkage, that is different than that obtained when using soft or hard thresholding. To evaluate our 
method, we compare it with SureShrink of Donoho and Johnstone [11] , by repeating the simulations reported 
in their paper. As we show, the recovery results tend to be better using our technique. Moreover, our approach 
is general as it is not tailored to a specific problem. We thus believe that using a regularized SURE principle 
together with the generalized SURE developed here can extend the applicability of SURE-based estimators 
to a broad class of problems. 

The remaining of the paper is organized as follows. In Section |TI] we introduce the basic concept of MSE 
estimation. An extension of SURE to exponential families is developed in Section IIIII In Section IIVI we 
discuss rank-deficient models in which the pdf of the data depends only on a projection of the unknown 
parameter vector. We then specialize the results to the linear Gaussian model in Section |V1 Applications 
to regularization selection are discussed in Section IVII The regularized SURE criterion, together with an 
application to wavelet denoising, are developed in Section IVHI 



II. MSE Estimation 

We denote vectors by boldface lowercase letters, e.g., x, and matrices by boldface uppercase letters e.g., 
A. The ith component of a vector y is written as yi, and (•) is an estimated vector. The identity matrix is 
written as I, A is the transpose of A, and A^ denotes the pseudo-inverse. For a length-m vector function 

h(u) £R m ofuGR™, 

f dh(u) \ _A 9/ti(u) 

V du J f-T dm ' K ' 

We consider the class of problems in which our goal is to estimate a deterministic parameter vector 6 
from observations x which are related through a pdf /(x; 0). We further assume that the pdf belongs to the 
exponential family of distributions and can be expressed in the form 

/(x; 6) = r(x) exp{0 T 0(x) - g(0)}, (2) 

where r(x) and 0(x) are functions of the data only, and g(6) depends on the unknown parameter 9. 

As an example of an application where the model ([2|) can occur, consider the location problem of estimating 
a parameter vector 6 S M m from observations x G R n related through the linear model: 

x = H0 + w, (3) 

where w is a zero-mean Gaussian random vector with covariance C y 0. The pdf of x is then given by ([2]) 
with 

r(x) = i exp{-(l/2)x T C- 1 x}; 

V ' ^/(27r)"dct(C) Fl V ' ; J ' 

(/.(x) = H T C -1 x; 
5 (0) = (1/2)0 T H T C- 1 H0. (4) 

Other examples of distributions in the exponential family include Poisson with unknown mean, exponential 
with unknown mean, gamma, and Bernoulli or binomial with unknown success probabilities. 



Given the model (|2]), a sufficient statistic for estimating 9 is given by 

u = 0(x). (5) 

Therefore, any reasonable estimate of x will be a function only of u. More specifically, from the Rao-Blackwell 
theorem |33| it follows that if 9 is an estimate of 9 which is not a function only of u, then the estimate i?{0|u} 
has lesser or equal MSE than that of 9, for all 9. Therefore, in the sequel, we only consider methods that 
depend on the data via u. 

Let 9 be an arbitrary estimate of 9, which we would like to design to minimize the MSE, defined by 
E{\\9 — 9\\ 2 }. In practice, 9 = h(u) where h(u) is some function of u that is typically chosen to have 
a particular structure, parameterized by a vector a. For example, h(u) = cm where a is a scalar, or 
hi(u) = Tp a (ui) where 

i> a (u) = sign(n)[|-u| - a]+ (6) 

is a soft-threshold with cut-off a. Ideally, we would like to select a to minimize the MSE. Since this is 
impossible, as we show below, instead in the SURE approach ex is designed to minimize an unbiased estimate 
(referred to as the SURE estimate) of the MSE. 
We can express the MSE of 9 = h(u) as 

E\\\9-9f\ = ||0|| 2 + £{||h(u)|| 2 }-2£{h T (u)0}. (7) 

In order to minimize the MSE over h(u) we need to explicitly evaluate the expression 

W (h,0) = £{||h(u)|| 2 }-2£{h T (u)0}. (8) 

Evidently, the MSE will depend in general on 9, which is unknown, and therefore cannot be minimized. 
Instead, we may seek an unbiased estimate of v(h, 9) and then choose h to minimize this estimate. The difficult 
expression to approximate is E{h T (u)9} as the dependency on 9 is explicit. Therefore, we concentrate on 
estimating this term. If this can be done, then we can easily obtain an unbiased MSE estimate. Specifically, 



suppose we construct a function g(h(u)) that depends only on u (and not on 6), such that 

E{g(h(u))} = E{h T (u)6}^E hfi . (9) 

Then 

t)(h) = ||h(u)|| 2 -2 5 (h(u)), (10) 

is an unbiased estimate of v(h, 9), since clearly E{v(h)} = v(h, 0). A reasonable strategy therefore is to select 
h(u) to minimize our assessment ■O(h) of the MSE. This approach was first proposed by Stein in [9], [10] for 
the iid Gaussian model (|3]) with C = I and H = I. 

The design framework proposed above reduces to finding an unbiased estimate of E h q . Clearly, any such 
approximation will depend on the pdf /(x; 0). In the next section we develop an unbiased estimate when the 
pdf is given by the exponential model ([2]). Before addressing the general setting, to ease the exposition we 
illustrate the main idea proposed by Stein, by first considering the simpler iid Gaussian case. In this setting 
we seek to estimate a vector 6 £ M m from measurements x = + w, where w is a zero- mean Gaussian random 
vector with iid components of variance a 2 . In Section IIVI we treat the more difficult case in which u lies in a 
subspace A of M. m , and the pdf ^) depends on only through its orthogonal projection onto A. This situation 
arises, for example, in the linear Gaussian model (J3j) when H is rank deficient. For this setup, we develop a 
SURE estimate of the MSE in estimating the projected parameter. In Sections IVl and IVT1 we consider several 
examples of estimates in which the free parameters are chosen to minimize the SURE objective. In particular, 
we propose alternatives to the popular GCV and discrepancy methods for regularization. In Section I VIII 
we suggest a regularized SURE strategy for determining h(u) without the need for parametrization, and 
demonstrate its performance in the context of wavelet denoising. 



III. Extended SURE Principle 

A. IID Gaussian Model 

We begin our development by treating the iid Gaussian setting. Since from (JH), u = (l/cr 2 )x, we consider 
estimates = h(x) that are a function of x. 

To develop an unbiased estimate of E h q, we exploit the fact that for the Gaussian pdf /(x; 6) 

( Xl -e l )f( X] e) = -a 2 ^^l. (ii) 

Assuming that £?{|/^(x)|} is bounded and /^(x) is weakly differentiablqj in x, we have that 

E h,6 = E / ^(x)^/(x;0)dx 

= E /°° w*) (W( x ; °) + ^^p) dx 

= E{h T ( X )x} + a*JT T kW^^-dx, (12) 

■ i J — CO "^i 



i=l 



where the second equality is a result of (|11|) . To evaluate the second term in (|12p . we use integration by parts: 

f°° hi(x) df{ *' ,e) dx = - r ^(x)/ 4 (x;0)dx = -Eih^)}, (13) 

J— oo OXi J— co 

were we denoted h'^x) = <9/ij(x)/<9xj, and used the fact that |/ij(x)/(x; 0)| — ► for |xj| — > oo since E^ | /ij (x) | } 
is bounded. We conclude from (fT2l) and (fT3l) that 



^h,0 = -v 2 E *^( x )} + ^ hT ( x ) x } 5 (14) 



and therefore, 



i=l 



2V ^ <9/i;(x) T 



1 Roughly speaking, a function is weakly differentiable if it has a derivative almost everywhere, as long as the points that are 
not differentiable are not delta functions; see 134] for a more formal definition. 



is an unbiased estimate of E h q . 

B. Extended SURE 

We now extend the basic approach outlined in the previous section to the general class of exponential pdfs. 
In order to address this model, we only consider methods that depend on the data via u. This enables the 
use of integration by parts, similar to the iid Gaussian setting. 

The following theorem provides an unbiased estimate of E {h T (u)#} which depends only on u and not on 
the unknown parameters 6. 

Theorem 1: Let x denote a random vector with exponential pdf given by ([2]), and let u = </>(x) be a sufficient 
statistic for estimating 6 from x. Let h(u) be an arbitrary function of 9 that is weakly differentiable in u and 
such that i?{|/ij(u)|} is bounded. Then 

B{ h>).>--*{ tt (^)}- B {l^.)^}, (16) 



where 



g(u) = y"r(x)<5(u-0(x))dx, (17) 



and 5(x) is the Kronecker delta function. 

Note, that as we show in the proof of the theorem, the pdf /(u; 0) of u is given by 



f(u;O)=q(u)exp{0 T u-g(0)}. (18) 



Therefore, an alternative to computing q(u) using (|17p is to evaluate the pdf of u and then use ([18 
From the theorem, it follows that 



,'<9h(u)\ , T / ^<91n(/(u) 
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is an unbiased estimate of E{h (u)G}. In the hd Gaussian case, u = (1/cj 2 )x so that 



gKu) 2 gh ( x ) , 9n N 

"a = a ~~ a • ( 20 ) 

au ox 



Furthermore, u is a Gaussian hd vector with elements that have mean (l/a 2 )6 and variance 1/cr 2 so that q(u), 
which is the normalization factor, is given by q(u) = if exp{— ||u|| 2 <7 2 /2} for a constant K. Consequently 

dlnq(u) 9 , v 

* = ~ a u = " x - 21 

ou 

Substituting (j20j) and (|2~T|) into (fl~9|) . the estimate reduces to (fT5|) . derived in the hd Gaussian setting. 

Proof: To prove the theorem we hrst determine the pdf of u. Since u = </>(x) we have that [331 P- 127] 



/(u; 0) = y /(x; 0)«5(u - 0(x))dx. (22) 



Substituting © into (g2]), 



/(u;0) = exp{0 r u- 5 (0)}yr(x)5(u-<A(x))dx 

= g (u)exp{0 T u- 5 (0)}. (23) 



Now, 



Noting that 



E{h T {u)9} = 

= fh T {u)0exp{0 T u-g(0)}q(u)du 

m „ 

= Y, I hi(u)9iexp{0 T u-g(e)}q(u)du. (24) 
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we have 



hi(u)0i exp{0 u — g(0)}q(u)du 

oo 



f°° , , w ,dexp{0 J u-g(0)} , 
= / /ii(u)g(u) ^-jj y -±^-dui 

= -f^^exp^u-^)}^, (26) 

where we used the fact that \hi (11)5(11) exp{0 u — g(6)}\ — ► for \m\ — ► 00 since i£{|/ii(u)|} is bounded. Now, 

dhj(u)q(u) dhjju) dq(u) 

dUi = -^r q{u) + -^r hi{u) - (27) 

Substituting ([26]) and fl27|) into flMD, 

£{h T (u)0} = 



-E/^^exp {9 -u- 9 ( 9 )}9u 

VV ^/ ^( u ) \ _£ /9g(u)/ii(u) 



8=1 



dui J [ 9uj g(u) 



- t i lf| ^))U{h»^r 



which completes the proof of the theorem. ■ 

Based on Theorem [1] we can develop a generalized SURE principle for estimating an unknown parameter 
vector in an exponential model. Specifically, let = h(u) be an arbitrary estimate of based on the data 
x, where h(u) satisfies the regularity conditions of Theorem [H Combining ([7]) and Theorem [TJ an unbiased 
estimate of the MSE of is given by 



12 : MU/..MI2 , n^J dh ( U )\ , ouT/„^ln?(u) 



S(h) = \\G\\ 2 + ||h(u)|| 2 + 2Tr ?-p- + 2h J (11) " ^ v ' . (29) 



We may then design 6 by choosing h(u) to minimize 5(h). 
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In the iid Gaussian case, (129j) reduces to 



|2 , ,|U/„M|2 , o,29h(x) r 



|0|| 2 + ||h(x)|| 2 + 2a ■ 2 —^ L ~ 2h J (x)x (30) 



where we used the fact that u = (l/cr 2 )x which implies dlnq(u)/du = — x, and dh(u)/d\i = a 2 <9h(x) / <9x. 
The MSE estimate (f30"j) was first proposed by Steiro in [9], [TO] . 

The SURE estimate can be used to determine unknown regularization parameters which comprise a given 
estimation strategy. An example is the SureShrink approach to wavelet denoising [TJJ . Extending this tech- 
nique, our general SURE objective (|29|) may be used to select regularization parameters in more general 
models. We discuss these ideas in the context of linear Gaussian problems in Section IVl 

IV. Rank-Deficient Models 

In some settings, the sufficient statistic u lies in a subspace A of M m . As an example, suppose that in the 
Gaussian model (J3j) H is rank-deficient. In this case u = H T C _1 x lies in the range space 1Z(H. T ) of H T , 
which is a subspace of M m . If 6 is not restricted to a subspace, then we do not expect to be able to reliably 
estimate 9 from u, unless some additional information on is known. Nonetheless, we may still obtain a 
reliable assessment of the part of 6 that lies in A. Denote by P the orthogonal projection onto A. Then, we 
show below, that if u depends on 6 only through P0, and in addition u has an exponential pdf, then we can 
obtain a SURE estimate of the error in A, namely .E{||P<? — P#|| 2 }. If in addition 6 lies in A, then up to a 
constant, independent of 0, this approximation is also an unbiased estimate of the true MSE i£{||0 — #|| 2 }- 

To derive the SURE estimate in this case, we first note that if u lies in A, then 

6 T u = (P0) T (Pu). (31) 

Suppose that A has dimension r < m. Since PO lies in an r-dimensional space, it can be expressed in terms 
of r components in an appropriate basis. Denoting by V an m x r matrix with orthonormal vectors that span 

2 We note that the expression obtained by Stein is slightly different since instead of optimizing = h(x), he considered estimates 
of the form 6 — x + h(x) and then optimized h(x). The two expressions differ by a constant, which does not effect the optimization 
ofh(x). 
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A = 1Z(P), the vector P9 can be expressed as P9 = V0 for an appropriate length-r vector 9'. Similarly, 
Pu = Vu'. Therefore, we can write 

e T u = e ,T u', (32) 

where we used the fact that V T V = I. We assume that u' is a sufficient statistic for 9' and that /(u'; 9') has 
an exponential pdf: 

/(u' ; 9') = g(u') exp{0'V - g(9')}. (33) 

Under this assumption, we next derive a SURE estimate of the MSE in A. 
The MSE in estimating 9 can then be written as 

E{\\9 - 9\\ 2 } = E{\\P9 - P9\\ 2 } + E{\\(1 - P)9 - (I - P)0|| 2 }. (34) 

If 9 lies in A, then (I — P)9 = and the second term is constant, independent of 9. Therefore, in this case, 
to optimize 9 it is sufficient to obtain an unbiased estimate of the first term. As we show below, such an 
assessment can be derived using similar ideas to those in Theorem [TJ Even if 9 does not lie in .4, the SURE 
estimate we develop may be used to approximate the first term. Since u depends only on P9, it is reasonable 
to restrict attention to estimates 9 = ho;(u) where the parameters a are tuned to minimize the MSE in A 
E{\\P9 — P0|| 2 }, subject to any other prior information we may have, such as norm constraints on 9. In such 
cases we can use a regularized SURE criterion with the SURE objective being the MSE in A, as we discuss 
in Section IVIII 

We now develop a SURE estimate of the MSE E'HIP^ — P#|| 2 }. To this end we need to find an unbiased 
estimate of 

E{9 T P9} = E{9 T V9'} = E{(V T 9) T 9'} (35) 

Let 9 = h(u) = h(u') (since u = Vu', it is clear that 9 is a function of u'). By our assumption, 9' has an 
exponential pdf with sufficient statistic u'. Therefore, we can apply Theorem [1] to V r h(u) for any function 



1-1 



h(u) that obeys the conditions of the theorem, which yields 



E{ h»v«'} = - E { Tr (v«)}- E {h>0v^M 



■iT'v|P^)}-^{tf-(«)v5^1j. 



where we used the fact that P = VV and 

dh(u') dh(u} 



du' du 



V. (37) 



We conclude that if h(u) is an estimate of a parameter vector 0, where u lies in a subspace A and is a 
sufficient statistic for estimating P#, with P denoting the orthogonal projection onto ^4, then an unbiased 
estimate of the MSE £{||Ph(u) - P0|| 2 } is given by 

S(h) = ||P0f + ||Ph(u)f + 2Tr (P^P) + 2h^u)V^01, (38) 

\ ou J ou 

with u = Vu' and V denoting an orthonormal basis for A. When A = W 11 , P = I, V = I and (|38p reduces to 

V. Linear Gaussian Model 

We now specialize the SURE principle to the linear Gaussian model ©. We begin by treating the case in 
which H is an n x m matrix with n> m and full column rank. We then discuss the rank-deficient scenario. 

A. Full-Rank Model 

To use Theorem Q] we need to compute the pdf q(u) of u. Since u = H C _1 x, it is a Gaussian random 
vector with mean H C _1 H0 and covariance H C _1 H. As q(u) is the function multiplying the exponential 
in the pdf of u, it follows from §%§ that 

g(u) =i^exp{-(l/2)u T (H T C- 1 H)- 1 u}, (39) 



1", 



where K is a constant, independent of u. Therefore, 



01ng(u) _ ,„/,-- u-.-i 



5 



u 



(H 1 C"^)-^ = -Oml, (40) 



where 0ml is the ML estimate of given by 



M L = (H i C- 1 H)- i H J C- i x. (41) 



It then follows from Theorem Q] that 



E {h T (u)6} = -E JTr (^) - h r (u)0 ML } • (42) 



Using (|29p and (|40p we conclude that 



,:> , HU/..MI2 , o(^f' 5h (u)\ , r . 



5(h) = ||0|| 2 + ||h(u)r + 2 ( Tr ( -±-± ) - h 1 (u)e ML ) , (43) 



is an unbiased estimate of the MSE. 

B. Rank-Deficient Model 

Next, we consider the linear Gaussian model (j3]) with a rank-deficient H. 

Suppose that H has a singular value decomposition H = USQ T for some unitary matrices U and Q. Let 
H have rank r so that S is a diagonal n x m matrix what the first r diagonal elements equal to o~i > 
and the remaining elements equal 0. In this case, V is equal to the first r columns of Q and 0' = V 0. A 
sufficient statistic for estimating 6' is u' = V H C x. Indeed, u' is a Gaussian random vector with mean 
H' = V T H T C- 1 H0 and covariance C = V T H T C ! HV. Using the SVD of H we have that 

C = A[U T C _1 U] r , (44) 
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where A is an r x r diagonal matrix with diagonal elements erf > and [A] r is the r x r top- left principle 
block of size r of the matrix A. Since C >- 0, C' is invertible. Therefore, 

/(u'; 6') = g(u') exp{0'u' - g{e')} (45) 



with 



q(v!) = -. I exp{-(l/2)u /T C / ~ 1 u / }; 

yV ' ^/(2tt)" dot(C') ^ l V ' ' J ' 

5 (0') = (l/2)6> /T A[U T C- 1 U] r 6> / . (46) 

We therefore conclude from (|38H that an unbiased estimate of the MSE £'{||Ph(u) — P#|| 2 } is given by 



\P0\\ 2 + ||Ph(u)|| 2 + 2 ( Tr ( P^^ ) - h 1 (u)e ML ) , (47) 



where 9ml = (H T C 1 H)^H T C 1 x is an ML estimate. Here we used the fact that VC 1 u / = 
(H T C" 1 H)tH T C- 1 x. 

We summarize our results for the linear Gaussian model in the following proposition. 

Proposition 1: Let x denote measurements of an unknown parameter vector in the linear Gaussian model 
([3]), where w is a zero-mean Gaussian random vector with covariance C >~ 0. Let h(u) with u = H r C _1 x be 
an arbitrary function of that is weakly differentiable in u and such that 22{|/ij(u)|} is bounded, and let P 
be an orthogonal projection onto 7£(H T ). Then 



E {h r (u)P0} = -E JTr (P^) " h r (u)0 ML } 



where 

ML = (H r C- 1 H)tH T C- 1 x 
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is an ML estimate of 9. An unbiased estimate of the MSE i£{||Ph(u) — P#|| 2 } is 



> , HDU/..MI2 , o/t„Id 91i ( u ) \ UT, 



5(h) = \\P0\\ Z + ||Ph(u)|| 2 + 2 ( Tr I P^-^ J - h 1 (u)0 ML \ . (48) 

C. Examples 

To illustrate the use of the SURE principle, suppose that we consider estimators of the form = a#ML 
where #ml is given by (|41[) . and we would like to select a good choice of a. To this end, we minimize the SURE 
unbiased estimate of the MSE given by Proposition Q] with h(u) = o#ml- Note that in this case h(u) G 7£(H T ) 
so that 5(h) + ||(I — P)#|| 2 is an unbiased estimate of the total MSE £?{||0 — Q\\ 2 } and therefore it suffices to 
minimize S'(h). 

For this choice of h(u), minimizing 5(h) is equivalent to minimizing 



a 2 



The optimal choice of a is 



I^mlII 2 + 2 («Tr ((H r C- x H)t) - a||0 M L|| 2 ) • (49) 



Tr ((H^C-^t) 

Q = 1 



(50) 

ML I 



|0„ T ||2 



resulting in 



V 8ml r / 



The estimate of (|5ip coincides with the balanced blind minimax method proposed in Eq. (45)], which 
was derived based on a minimax framework |8]. Here we see that the same technique results from applying 
our generalized SURE criterion. A striking feature of this estimate, proved in [7], is that when H r C _1 H is 
invertible and its effective dimension is larger than 4, it dominates ML for all values of 9 (see Theorem 3 in 
|_7J). This means that its MSE is always lower than that of the ML method, regardless of the true value of 6. 
When H = I and C = cr 2 I, §3) reduces to 



no 1 



x, (52) 



IN 



which coincides with Stein's estimate [T]. This technique is known to dominate ML for n > 3. 
If in addition we require that a > 0, then the estimate of (|51h becomes 



e 



Tr ((H r C" 1 H)t) 



'ML 



'ML, 



(53) 



(54) 



where we used the notation 

x, x > 0; 

0, x < 0. 

The method of (1531) is a positive-part version of (|5ip . In the iid case, it reduces to the positive-part Stein's 
estimate [35], which is known to dominate the standard Stein approach (152|) . 

Next, consider the case in which H = I and C = D with D = diag (af, ...,&%) and suppose we seek a 
diagonal estimate of the form 9i = Oi%Xi. Minimizing the unbiased estimate of (|48p in this case is equivalent 
to minimizing 

n n n 

^2^xj + 2^2afai-2j2 a i x h (55) 



i=\ 



which yields 



i=l 



Oi = l %. 



1=1 



(56) 



Restricting the coefficients on to be non-negative leads to the estimate 



1-2? 

xf 



■' ■-■/. 



(57) 



In contrast to 6 of ([51 j). which dominates the ML method, it can be proven that the estimate of (|57p is 
not dominating. Thus, we see that by allowing for too many free parameters, we impair the performance 
of the SURE-based estimate. On the other hand, assuming strong structure, as in (|51j) . severely restricts 
the class of estimators and consequently limits the possible performance advantage which can be obtained. 
In Section IVIII we suggest a regularized SURE strategy in order to overcome this inherent tradeoff between 
over-parametrization and performance. 
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VI. Application to Regularization Selection 

A popular strategy for solving inverse problems of the form (|3j) is to use regularization techniques in 
conjunction with a LS objective. Specifically, the estimate is chosen to minimize a regularized LS criterion: 

(x-H0)C- 1 (x-H0)+A||L0|| (58) 

where the norm is arbitrary. Here L is some regularization operator such as the discretization of a first or 
second order differential operator that accounts for smoothness properties of 6, and A is the regularization 
parameter [28], [27]. An important problem in practice is the selection of A, which strongly effects the recovery 
performance. One of the most popular approaches to choosing A when the estimate is linear (as is the case 
when a squared-^2 norm is used in (|58j) ) is the generalized cross-validation (GCV) method |32j. When the 
estimate takes on a more complicated nonlinear form, a popular selection method is the discrepancy principle 
[26]. 

Based on our generalized SURE criterion, we choose A to minimize the SURE objective (|48p . As we 
demonstrate for the cases in which the norm in (|58H is the squared-^2 or l\ norms, this method can dramatically 
outperform GCV and the discrepancy technique in practical applications. 

A. Image Deblurring 

We first consider the case in which the squared-^2 norm is used in (|58p . The solution then has the form 

= (Q + XL T lL)- ll EFC- 1 x, (59) 

where for brevity we denoted 

Q = H T C _1 H. (60) 

The estimate (|59[) is commonly referred to as Tikhonov regularization 
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In the GCV method, A is chosen to minimize 



G(\) 



Tr 2 (I-(Q + AL^L)-iQ)^ 



£(* 



me] 



(61) 



The SURE choice follows directly from minimizing (148 j) . In our simulations below, both minimizations where 
performed by using the fmincon function on Matlab. 

To demonstrate the performance of the proposed regularization method, we tested it in the context of 
image deblurring using the HNO deblurring package for Matlaqfl based on [36]. We chose several test images, 
and blurred them using a Gaussian point-spread function of dimension 9 with standard deviation 6 using 
the function psf Gauss. We then added zero-mean, Gaussian white noise with variance a 2 . In Figs. [Q and 
[2] we compare the deblurred images resulting from using the Tikhonov estimate (|59p with L = I where the 
regularization parameter is chosen according to our new SURE criterion (left) and the GCV method (right), 
for different noise levels. 

As can be seen from the figures, our SURE based approach leads to a substantial performance improvement 
over the standard GCV criterion. This can also be seen in Tables U and [XT] in which we report the resulting 
MSE values. 

TABLE I 

MSE for Tikhonov Deblurring of Lena 





a = 0.01 


a = 0.05 


£7 = 0.1 


GCV 


0.0022 


0.0077 


0.0133 


SURE 


0.0011 


0.0025 


0.0042 



TABLE II 

MSE for Tikhonov Deblurring of Cameraman 





a = 0.01 


a = 0.05 


a = 0.1 


GCV 


0.0033 


0.0121 


0.0221 


SURE 


0.0016 


0.0039 


0.0064 



The package is available at http://www2.imm.dtu.dk/~pch/HN0/. 
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B. Deconvolution Example 

As another application of the SURE, consider the standard deconvolution problem in which a signal 9[£] 
is convolved by an impulse response h[£] and contaminated by additive white Gaussian noise with variance 
a 2 . The observations x[£] can be written in the form of the linear model ([3j) where x is the vector containing 
the observations x[£], 9 consists of the input signal 8[£], and H is a Toeplitz matrix, representing convolution 
with the impulse response h[£]. 

To recover 6[£] we may use a penalized LS approach (|58p where we assume that the original signal 9[£] is 
smooth. This can be accounted for by choosing a penalization of the form ||L0||i where L represents a second 
order derivative operator. The resulting penalized LS estimate can be determined by solving a quadratic 
optimization problem. In our simulations, we used CVX, a package for specifying and solving convex programs 
in Matlab [37] . 

Since the resulting estimate is non-linear, due to the £\ penalization, we cannot apply the GCV equation 
([ST]) . Instead, a popular approach to tune the parameter A is to use the discrepancy principle in which A is 
chosen such that the residual ||x — H#|| 2 is equal to the noise level na 2 |26j . [25] . 

To evaluate the performance of the SURE principle in this context, we consider an example from the 
Regularization Tools [38] for Matlab. All the problems in this toolbox are discretized versions of the Fredholm 
integral equation of the first kind: 

g(s) = [ K(s,t)9(t)dt (62) 

J a 

where K(s,t) is the kernel and 9(t) is the solution for a given g(s). The problem is to estimate 9(t) from 
noisy samples of g(s). Using a midpoint rule with n points, (I62p reduces to an n x n linear system xy = H6. 
The functions in this toolbox differ in K{t, s) and 9{s). Below we consider the function heat(n) with n = 80. 
The output of the function is the matrix H and the true vector 9 (which represents 6(t)). The observations 
are x = xy + w where w is a white Gaussian noise vector with variance a 2 = 1. 

In Fig. [3] we plot the original signal along with the observations x, and the clean convolved signal x^ = HO. 
The original signal along with the estimates using the SURE principle and the discrepancy method are plotted 
in Fig. [TJ To evaluate the gradient of the estimate, we used the Monte-Carlo SURE approach proposed in 
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[39j . Evidently, the SURE method leads to superior performance. The MSE using the SURE approach in this 
example is 0.10 while the discrepancy strategy leads to an MSE of 1.16. 

VII. Regularized SURE Method 

A crucial element in guaranteeing success of the SURE method is to choose a good parameterization of 
h(u). However, in many contexts, such a structure may be hard to find. On the other hand, letting the SURE 
criterion select many free parameters can deteriorate its performance. One way to treat this inherent tradeoff 
is by regularization. Thus, instead of minimizing the SURE objective we suggest minimizing a regularized 
version: 

5(h,A) = 5(h) + Ar(h(u)) (63) 

where A is a regularization parameter and r(h(u)) is a regularization function. For example, we may choose 
r(v) = ||v|| where the norm is arbitrary. The parameter A is determined by applying the conventional SURE 
(|29p (or (I38p ) to the estimate h(u, A) resulting from solving (j63[) with A fixed. 

As an example, consider the iid Gaussian model in which x = + w where w is a Gaussian noise vector 
with iid zero- mean components of variance a 2 . Assuming that 6 represents the wavelet coefficients of some 
underlying signal x, a popular estimation strategy is wavelet denoising in which each component of x is 
replaced by a soft or hard-thresholded version. In particular, in their landmark paper, Donoho and Johnstone 
|llj developed a soft-threshold wavelet denoising method in which 

\^i\ £> \%i 1 — *: 

(64) 
0, \xi\ < t, 

where t is a threshold value. They suggest selecting t to minimize the SURE criterion, and refer to the resulting 
estimate as SureShrink (to be more precise, in SureShrink t is determined by SURE only if it lower than some 
upper limit). In developing the SureShrink method, the function h(x) is restricted to be a component- wise 
soft threshold. The motivation for this choice is that the wavelet coefficients below a certain level tend to be 
sparse. It is well known that soft-thresholding can be obtained as the solution to a LS criterion with an l\ 
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penalty: 

min{||x-0|| 2 + A||0||i}. (65) 

Thus, in principle we can view the SureShrink approach as a 2-step procedure: We first determine the estimate 
that minimizes an i\ penalized LS criterion. We then choose the penalization factor to minimize SURE. 

Instead, we suggest choosing an estimate that directly minimizes an t\ regularized SURE objective, where 
the only assumption we make is that the processing is performed component wise. Thus, 9i = o^xi for some 
coefficients aj(x) > 0. Since u = (l/a 2 )xi, /ij(u) = a 2 aiUi. With this choice of h(u), minimizing ([63]) is 
equivalent to minimizing the following objective: 

n n n 

S{a,X) = ^afx 2 + 2^2ai (a 2 - x 2 ) + X^ NN- (66) 



The optimal choice of a, > is 

a 2 + \\xi 



a; 



1 



A 



(67) 



The resulting estimate can be viewed as a soft-thresholding method, with a particular choice of shrinkage 
(different than the standard approach (|64p ) when the value of X{ exceeds the threshold. The precise threshold 
value is equal to the largest value Xi for which a» = and is given by 

t = i (A + VA 2 + 4a 2 ) . (68) 



To choose A, we substitute the estimate Q{ = Oi(X)Xi with aj(A) given by (f67l) into the SURE criterion (|48|) 
with P = I, and minimize with respect to A. The value of A can be easily determined numerically. 

To demonstrate the advantage of our method over conventional soft-thresholding we implemented our 
approach on the examples taken from [11]. Specifically, we used the test functions Blocks, Bumps, HeaviSine 
and Doppler defined in [11]. The length of all signals is 2048 and the noise variance is a 2 = 4. We used the 
Daubechies 8 symmetrical wavelet, and L = 5 levels are considered. In Table Hill we report the empirical MSE 
values of the original noisy signals, and 3 wavelet denoising schemes: SureShrink which is the method of [11] 
with the threshold selected using SURE, our proposed regularized SURE method (RSURE), and OracleShrink 
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which is a soft-threshold where the threshold value is selected to minimize the squared-error between the true 
unknown wavelet coefficient, and its denoised version. Clearly this approach is only for comparison and serves 
as a benchmark on the best possible performance that can be obtained using any soft threshold. As can 

TABLE III 

MSE for Different Soft Denoising Schemes 





Blocks 


Bumps 


HeaviSine 


Doppler 


Original 


4.054 


4.072 


4.153 


3.945 


SureShrink 


0.744 


0.875 


0.205 


0.290 


RSure 


0.694 


0.816 


0.169 


0.273 


OracleShrink 


0.690 


0.828 


0.118 


0.283 



be seen from the table, the regularized SURE method performs better in all cases than SureShrink. It is 
also interesting to see that it sometimes even outperforms OracleShrink which is based on the true unknown 
0. The reason the performance can be better than the oracle is that the shrinkage performed in RSURE is 
different than the conventional soft threshold. 

In Table |IV] we repeat our experiments where now we use the estimates resulting from the standard SURE 
criterion. Specifically, we consider the positive-part Stein estimate (|51j) referred to as SteinShrink and the 
estimate (|57p which we refer to as ScalarShrink. Evidently, using the SURE estimate without regularization 

TABLE IV 

MSE for Different Denoising Schemes 





Blocks 


Bumps 


HeaviSine 


Doppler 


ScalarShrink 


1.043 


1.362 


0.161 


0.594 


SteinShrink 


1.681 


1.730 


1.508 


1.413 



deteriorates the performance significantly. Thus, SURE alone is not generally sufficient to obtain good esti- 
mates. However, adding regularization dramatically improves the behavior without the need to pre-specify 
the desired structure. 

Finally, in Table IVl we repeat the experiments of Table HTT1 to determine the threshold values, but once the 
values are found we apply hard-thresholding on the coefficients. As can be seen from the table, even though 
the thresholding operation is now the same in both methods, RSURE performs significantly better. Thus, 
the threshold determined from this method is superior to that resulting from the SURE criterion without 
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TABLE V 
MSE for Different Hard-Thresholding Schemes 





Blocks 


Bumps 


HeaviSine 


Doppler 


SureShrink 


1.902 


1.961 


0.988 


0.630 


RSure 


1.560 


1.912 


0.766 


0.700 



regularization. Here again the importance of regularization is demonstrated. 

VIII. Conclusion 

In this paper, we developed an unbiased estimate of the MSE in multivariate exponential families by ex- 
tending the SURE method. This generalized principle can now be used in exponential multivariate estimation 
problems to develop estimators with improved performance over existing approaches. As an application, we 
suggested a new strategy for choosing the regularization parameter in penalized inverse problems. We demon- 
strated via several examples that this method can significantly improve the MSE over the standard GCV and 
discrepancy approaches. We also suggested a regularized SURE criterion for selecting estimators without the 
need for pre-specifying their structure. Applying this objective in the context of wavelet denoising, we pro- 
posed a new type of soft-thresholding which minimizes a penalized estimate of the MSE. As we demonstrated, 
this strategy can lead to improved MSE behavior in comparison with soft and hard thresholding methods. 

The main contribution of this work is in introducing the generalized SURE criterion and the regularized 
SURE method and demonstrating their applicability in several examples. In future work, we intend to develop 
these applications in more detail and further explore the practical use of the proposed design objectives. 
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(a) 



(b) 





(c) 



(d) 





(e) 



(f) 



Fig. 1 
Deblurring of Lena using Tikhonov regularization with SURE (left) and GCV (right) choices of 

REGULARIZATION AND DIFFERENT NOISE LEVELS: (a), (b) a = 0.01 (c),(d) a — 0.05 (e),(f) a = 0.1. 
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(a) 



(b) 





(c) 



(d) 





(e) 



(f) 



Fig. 2 
Deblurring of Cameraman using Tikhonov regularization with SURE (left) and GCV (right) 

CHOICES OF REGULARIZATION AND DIFFERENT NOISE LEVELS: (a), (b) a = 0.01 (c),(d) a = 0.05 (e),(f) <J = 0.1. 
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Fig. 3 
The original signal (dashed), the clean convolved signal (star) and the observations x with a = 1. 




- Discrepancy 

-SURE 

-True 



Fig. 4 
Deconvolution using weighted £i regularization with the discrepancy principle, SURE (star) and 

the true signal (dashed) with a = 1. 



